■ ' ^ WOTICB 

nJ"}S MATERIAL MAY B£ 
PROTECTED BY COPYRIGHT 
\ (LAW (TITLE 7, U.S. CODE) 

Be^otnidl compiete getmomeso from sequence to stactaire asud 



Eugene V Koonin*, Roman L Tatusov and Michael Y Galperin 



355 



Computer analysis of complete prokaryotic genomes shows 
that microbial proteins are in general highly conserved - 
-70% of them contain ancient conserved regions. This allows 
us to delineate families of orthologs across a wide 
phylogenetic range and, in many cases, predict protein 
functions with considerable precision. Sequence database 
searches using newly developed, sensitive algorithms result in 
the unification of such orthologous families into larger 
superfamilies sharing common sequence motifs. For many of 
these superfamilies, prediction of the structural fold and 
specific amino acid residues involved in enzymatic catalysis is 
possible. Taken together, sequence and structure comparisons 
provide a powerful methodology that can successfully 
complement traditional experimental approaches. 

Addresses 

National Center for Biotechnology Information, National Library of 
Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA 
* e-mail: koonin@ncbi.nlm.nih.gov 
Correspondence: Eugene V Koonin 

Current Opinion in Structural Biology 1 998, 8:355-363 
http://biomednet.com/elecref/0959440X00800355 
'© Current Biology Ltd ISSN 0959-440X 
Abbreviations 

COGs clusters of orthologous groups 
HAD haloacid dehalogenase 

Introduction 

The determination of the complete genome sequences 
of several bacteria and archea and one eukaryote 
jl-6.7"-12"| marked the beginning of a new age in biol- 
ogy. For the first rime, we can take a look at the com- 
plete set of proteins present in the cells of each 
particular organism and try to identify the proteins 
responsible for each cellular function. In cases where no 
known proteins can be found to perform a particular 
task, the most likely substitutes can be predicted from 
the set of unassigned gene products. Clearly this can be 
done only by analysis of complete genomes, as partial 
sequences do not allow us to ascertain that certain pro- 
teins are not encoded in a given genome [13], These 
new approaches are gradually changing our understand- 
ing of a variety of biological phenomena. As the number 
of sequenced genomes is expected to grow exponential- 
ly for the next few years, their impact on different bio- 
logical disciplines will increase. We have recently 
discussed the implications of the complete genomes for 
microbial evolution [14J. Here wc consider the effect of 
the genome revolution, together with the improving 
methods for sequence analysis, on our ability to predict 
and understand protein structure and function. 



Towards a natural taxonomy of proteins and 
protein ffamiOaes 

The numerous genome sequencing projects have resulted 
in a rapid growth of protein databases (see, e.g. [15]). In 
contrast to the pre-genome era, when researchers typically 
chose to clone and sequence genes with documented 
functional roles, we are now getting many protein 
sequences whose functions are not known. This presents 
a challenge to extract the most from these sequences in 
terms of salient features of the encoded proteins, for exam- 
ple to classify them according to their homologous rela- 
tionships, and to predict their possible catalytic activities 
and/or cellular functions, three-dimensional (3D) struc- 
tures and evolutionary origin. 

Protein classifications, pioneered by Dayhoff and her co- 
workers, have historically been based on sequence align- 
ments. Similar proteins formed families, which were 
combined into superfamilies [16]- This approach, contin- 
ued in the PIR database [17], proved extremely popular. 
However, even PIR superfamilies often unite closely 
related proteins and more distant relationships are being 
missed. Other protein databases, such as PROSITE [18], 
PRINTS [19], Pfam [20], and ProDom [21 J, group pro- 
teins on the basis of conserved sequence motifs and, gen- 
erally, contain much more diverse protein families. 
Structural comparisons of proteins, implemented in FSSP, 
CATH and SCOP databases, offer yet another approach 
to protein classification [22-24]. SCOP superfamilies, for 
example, unite proteins that have some similarities in 
their 3D structures, but often no detectable sequence 
similarity [25]. Thus, in the absence of clear sequence or 
structural similarities, the criteria for inclusion of distant- 
ly related proteins into a family (or supcrfamily) become 
increasingly arbitrary. 

With the inception of extensive genome sequencing, it has 
become possible to classify genes and proteins on a differ- 
ent principle, namely by delineating families of paraiogs — 
related genes within the same genome [26.27]. Such 
analyses have revealed a complex hierarchical organization 
of paralogous families in each of the studied genomes and 
produced at least two generalizations: first, the fraction of 
genes that belong to families of paraiogs increases with the 
increase of the total number of genes in a genome: from 
-25% in the minimal genome of Mycoplasma genita Hum to 
>50% in the large (for a prokaryote) Escherichia coli genome; 
second, the largest superfamilies of paraiogs are mostly the 
same in all genomes [28-33]. 

Knowledge of all the protein sequences from multiple com- 
plete genomes (Table 1) allows us to redefine the entire 



356 Sequences and topology 



Table 1 



Protein families and 3D structures in complete genomes. 



Species 


Proteins encoded in the genome* 


COGs found 
(% total) 


3D structures 




Total 


Belong to COGs* 




In PDB 


Predicted- 




number 


(% total) 








Escherichia coii 


4289 


2003 (47%) 


821 (95%) 


240 


667 


Haemophilus influenzae 


1717 


979 (57%) 


DOO \ 1 f /Of 


2 


267 


Helicobacter pylori 


1566 


841 (54%) 


617 (72%) 


0 


169 


Synechocystis sp. 


3169 


1551 (49%) 


796 (93%) 


2 


431 


Borrelia burgdorferi 


850 


483 (57%) 


363 (42%) 


0 


105 


Bacillus subtilis 


4100 


1945 (47%) 


732 (85%) 


12 


578 


Mycoplasma genitalium 


4 67 


341 (75%) 


290 (34%) 


0 


75/103 


Mycoplasma pneumoniae 


677 


378 (56%) 


309 (36%) 


0 


78 


Methanococcus jannaschii 


1715 


830 (48%) 


498 (58%) 


0 


170 


Methanobacterium thermoautotrophicum 


1869 


897 (48%) 


484 (56%) 


0 


199 


Archaeoglobus fulgidus 


2407 


1131 (47%) 


512 (60%) 


0 


290 


Saccharomyces cerevisiae 


5932 


1736 (29%) 


577 (67%) 


45 


846 


Caenorhabditis elegans 


12,178 


2172 (18%) 


466 (54%) 


2 


NA 



*The numbers are from the latest updates in the GenBank genome division (ftp://ncbi.nlm.nih.gov/genbank/genomes). C. elegans genome is about 
85% complete; the data are from Wormpepl2 (www.sangerac.uk/Projects/C_elegans/wormpep). *Based on the set of 860 COGs, obtained by 
adding H. pylori proteins to the original set of 720 COGs [37*']. *The numbers are from the PEDANT database [53*], calculated by comparing the 
protein set encoded in each genome to the PDB using FASTA with cutoff score of 1 20; the second figure for M. genitalium is from [54*]; the data 
for C. elegans are not available. 



problem of protein classification. Since the fraction of pro- 
teins conserved over large phylogenetic distances (ancient 
conserved domains) appears to be nearly constant at -707c 
in all prokarvotic genomes [34*]. it becomes feasible to 
replace more or less arbitrary clustering of proteins by simi- 
larity with consistent groups in which the evolutionary rela- 
tionships between the members are specifically defined. 
Such a classification of proteins can provide a framework for 
evolutionary studies and for rapid, largely automatic, func- 
tional annotation of newly sequenced genomes. 

Several classifications of homologous proteins encoded in 
complete genomes have been produced, based on all- 
against-all protein sequence comparisons [35.36,37"]. Each 
of these projects is aimed at the identification of orthologs, 
that is direct counterparts in different genomes, connected 
by an uninterrupted line of vertical descent and typically 
retaining their physiological function (26,27]. In particular, 
the system of clusters of orthologous groups (COGs) was 
designed to accommodate the vastly different evolution 
rates observed for different genes [3 7"). The COGs con- 
struction procedure identifies the closest homologs in each 
of the sequenced genomes for each protein, even if the sim- 
ilarity is fairly low and not statistically significant by itself. 
The approach to the identification of COGs was built upon 
the transitivity of orthologous relationships, that is the sim- 
ple notion that any group of at least three genes from dis- 
tant genomes, which are more similar to each other than 
they are to any other genes from the same genomes, is most 
likely to belong to an orthologous family. Clearly, this is a 
probabilistic assumption based on a 'weak molecular clock 
concept', which posits that oithologs are more similar to 
each other than they are to para logs with different, even if 



related, functions. This assumption, however, seems to 
hold true in cases where wc have reasons to accept ortholn- 
gy on functional grounds (for example. aminoacyl-tRNA 
synthetases or ribosomal proteins). Orthology is not neces- 
sarily a one-to-one relationship, as in cases of lineage-spe- 
cific duplications, orthology can only be established 
between families of paralogous genes. Such complex rela- 
tionships require caution in the functional interpretation of 
the phylogenetic classification of proteins. Nevertheless, 
about 60% of the original set of 720 COGs [37"] are simple 
families, with no paralogs or with paralogs from one lineage 
only, suggesting the possibility of straightforward transfer of 
functional information from functionally characterized 
genes from model systems such as E. cofi and yeast to tho^c 
from poorly characterized genomes. 

The utility of this system of protein classification was test- 
ed on several newly sequenced bacterial, archeal and 
eukaryotic genomes. Interestingly, with chc only exception 
of the minimal genome of M. gemtalhwu the fraction of the 
proteins that belong to the COGs — ancient families con- 
served across a wide phylogenetic range — is about the 
same and very close to 50SF for all prokarvotic genomes 
("J able 1 ). This is clearly compatible with the previous esti- 
mate that about 70?< of the proteins encoded in each 
genome contain ancient conserved regions. The fraction «•! 
the proteins included in the COGs is at this time io\\c»- 
which is evidently due to the requirement for three disrunt 
lineages to be included, and to the limited number 
species in the first instalment of the COGs. There is Mnu* 
doubt that with new genomes added, the number of ( - 1 ^ r> 
will asymptotically approach the total number of ancicm 
conserved regions. By contrast, this fraction is much l'»v^' r 



Beyond complete genomes Koonin, Tatusov and Galperin 357 



for cukaryotic genomes, indicating the prevalence of 
e n k a r y o t c - s p c c i fi c f a m i I i e s . 

Comparison of the new proccin sees with the COGs result- 
ed in a number of functional predictions for previously 
uncharactcrized proteins. Even for the Helicobacter pylon 
proteins, most of which show highly significant similarity to 
homologs from E. coli and other bacteria and have been 
described in considerable detail [8"|, predictions were made 
in more than 1(K) cases (http://www.nebi.nlm.nih/COG): 
function was also predicted for a number of a re heal and 
worm proteins (EV Koonin. RL Tatusov. MY Galperin, 
unpublished data). 

Missing gene families and evolution of 
metabolic pathways 

Comparative analysis of the available complete genomes 
shows that metabolic diversity generally correlates with 
genome size. Parasitic bacteria import a variety of metabo- 
lites, which allows them to shed genes encoding enzymes 
for many or even most of the metabolic pathways [1-3, 
8"\33..i8]. In contrast, all ceils have to rely on their own 
gene products for performing such essential functions as 
genome expression, replication and repair, and membrane 
biogenesis and others. These tasks alone require at least 
about 200 genes [1.U7"). 

Given complete genome sequences, classification of pro- 
teins into orthologous groups provides a convenient way to 
systematically survey the protein families present or 
absent in a genome and to identify the metabolic pathways 
that are likely to be operative in the organism analyzed. 
When some of the required enzymes cannot be found in 
the genome, the respective pathways are either not opera- 
tive, or use other, unrelated, proteins to catalyze the miss- 
ing steps (see [39]). An example of such an analysis, which 
included superposition of the phylogenetic patterns 
derived from the COGs [37**], over the scheme of glycoly- 
sis, reveals several interesting trends (Figure 1 ). Glycolysis 
includes three reactions that in different species are cat- 
alyzed by non -orthologous enzymes, namely phosphofruc- 
tokinases, aldolases and phosphoglycerate mutascs. 
Interestingly, the second phosphofructokinase in E. coli. 
encoded by the pfkB gene; has apparently been recruited 
from a ubiquitous family of ribokinase-like sugar kinases. 
The ribokinase COG seems to be an example of a complex 
family in which the exact orthologous connections are not 
always easy to trace. In particular, even though PfkB for- 
mally belongs to the COG. there seems to be no actual 
orrholog of it in other genomes. Thus H. pylori does not 
encode a phosphofructokinase at all. although it has genes 
for other kinases of the ribokinase family and, accordingly, 
is represented in the respective COG (Figure I). 

A remarkable case of non-orthologous gene displacement 
involves two unrelated forms of phosphoglycerate mutase. 
the 2.3-bisphosphoglycerute (BPG)-dependent and the 
BPG-independcnt one. While H. influenzae and Borrelia 



burgdorferi encode only the BPG-dependent form, and H. 
pylori, mycoplasmas, and archea encode only the BPG- 
independent form (see [40]). free-living bacteria such as £. 
coli. Bacillus subtilh and Synechorystis sp. possess genes cod- 
ing for both these forms, with two paralogs of the BPG- 
dependent one (Figure 1). Phosphofructokinase, aldolase 
and tructose bisphosphatase genes arc all missing in the 
archea (Figure 1), in accordance with the experimental 
data [41 1. This is consistent with the idea that glycolysis 
originally evolved as a biosynthetic pathway, containing 
only the lower (tri-carbon) part [42]. 

Systematic identification of missing links in functional sys- 
tems in organisms for which complete genome sequences 
are available is probably the most important application of 
protein family classification. Conspicuous gaps in the H. 
pylon metabolism became apparent from the COG analy- 
sis, suggesting major revisions to the general scheme of the 
central metabolic pathways in this bacterium (Table 2). In 
particular, unlike most other bacteria (and all with com- 
pletely sequenced genomes), H. pylori seems to possess 
neither glycolysis nor the pentose phosphate shunt, the 
Entner-Doudoroff pathway being the only major route of 
sugar catabolism. Indeed, sugar fermentation, resulting in 
intracellular acid production, would be an additional bur- 
den on the pH maintenance mechanism in this bacterium, 
which has to survive in an external pH of 2-3. By contrast, 
gluconeogenesis. which converts organic acids into sugars 
required for nucleic acid and peptidoglycan biosynthesis 
and thus removes H' from the cytoplasm, appears to he 
f u 1 1 y f u nc t i o n a I i n //. pylori. F o r t h e p u r pose o f e n e rgy p ro- 
duction, H. pylori apparently depends on amino acid fer- 
mentation, which causes alkali nidation of the cytoplasm 
and thus relieves part of the problem of pH maintenance. 
Amino acids and oligopeptides that serve as substrates for 
this fermentation are produced by gastric proteolysis and 
transported by readily identifiable permeases. 

From genomes and families to superfamilies 
and folds 

Classification systems aimed at the identification of fam- 
ilies of orthologs make no attempt to capture the more 
subtle conserved motifs in proteins, which reflect 
ancient relationships at the level of superfamilies and 
frequently are critically important tor understanding pro- 
tein functions and structures [43.44], Computer methods 
for the detection of such motifs and delineation of super- 
families have lately progressed significantly through pro- 
grams such as BLIMPS/MCI TIM AT |45|, Probe [4r>|. 
and PSI-BLAST (47"|. which combine pairwise 
sequence comparisons with profile analysis. PSI-BLAST. 
in particular, has proved to be a powerful tool for the 
detection of subtle sequence motifs, resulting in the dis- 
covery of a number of unsuspected superfamily relation- 
ships [47". 48*]. Furthermore, one of the perhaps 
under-appreciated benefits of the accumulation of 
genomic sequences is the greatly improved capacity to 
identify even very subtle sequence similarities due to 



358 Sequences and topology 



Figure 1 



ehubgplcm--y 



eh-bgplc— yw 



Gtucose-6P 

< * * 

{ pgi, COG0166 j 

Fructose-6P 



ehub— c— y 




ehubgplcmtfyw 



ehubgplc— yw 



Fructose 1.6-bisP 
ft 

^£jba ; COG0191^J^ 



Dihydroxyacetone-P <= ( tpi, COG0149 Glyceraldehyde-3P 



ehubgplcmtfy 

ehubgplcmtfyw 

eh-b— lc — yw 
e-b—c— yw 

ehubgplcmtfyw 
eh-bgplcm--yw 



f gapACOG0057 ) 

ehubgplcmtfyw 

1 ,3-bisP-Glycerate 
A 

( pgKCOG0126 ) 
3-P-Glycerate 



gpmA, 
COG0588 




yibO, 
COG0696 


> gpmB, 
.COG0406, 





e-ubgp-cmtf-w 



2-P-Glycerate 

< A * 

( eruyCOG0148 j 

P-enolpyruvate 



ehubgplcmtf-w 



( pykA COG046g ) ( ppsA COG0574 ) 
Pyruvate 



Current Opinion in Structural Biology 



Glycolytic enzymes in organisms with completely sequenced genomes. The enzymes are listed under £ coli gene names. The COG numbers are 
as in COG database (www.ncbi.nlm.nih.gov/COG, [37-*]) (where available). Shaded arrows indicate reversible reactions, black arrows practically 
irreversible ones. Phosphoenolpyruvate synthase-catalyzed reaction in the direction of phosphoenolpyruvate hydrolysis has been demonstrated in 
vitro. Phyiogenetic patterns are: e, Escherichia coli; h, Haemophilus influenzae; u, Helicobacter pylori\ b, Bacillus subtifis; g, Mycoplasma 
genitafium; p. Mycoplasma pneumoniae; I, Borrelia burgdorferi; c, Synechocystis sp.; m, Methanococcus jannaschii; U Methanobacterium 
thermoautotrophicum; f, Archaeoglobus fulgidus; y, Saccharomyces cerevisiae; w, Caenorhabditis efegans. 



the increasingly uniform population of the protein uni- 
verse by these relatively unbiased sequence sets, of 
which the new methods for sequence analysis mentioned 
above can take advantage [49"). 

In the past year, we have seen the identification or signif- 
icant extension of a number of protein supcrfamilies; 
some examples, with the distribution among complete 
genomes, are shown in Table 3. Most of these supcrfami- 
lies are universally found in all genomes, with the counts 
more or less proportional to the total number of genes in 
the genome. Some expansions are. however, remarkable. 



such as, for example, urease-related hydrolases and A I 1'- 
grasp domains in the archea, and HAD superfamily hydro- 
lases in E. coli and B, subtilis (Table 3). In certain cases, the 
phyiogenetic distribution of a superfamily immediately 
suggests major evolutionary events. Thus the BKC- I 
domain is present in a single copy in the DNA ligase oi all 
bacteria (with one additional copy found only 1,1 
Syriechorystis). is missing in the archea. and is dramatic;) lb 
expanded in its distribution in the eukaryotes (Tabic .v 1 - 
The most obvious interpretation of this distribution is rii Ji 
this domain has entered the eukaryotic world by hori/'" 1 " 
tal gene transfer from bacteria and has undergone ex ten- 



Beyond complete genomes Koonin, Tatusov and Galperin 359 



Table 2 



Genes and pathways missing in Helicobacter pylori. 



Enzyme activity 


£ co// gene 


COG number 


Status in H. pylori 


Implications for hi. pylon metabolism 


Phosphofructokinase 


pfkA 


COG0206 


Missing 


Absence of the two key glycolytic enzymes shows that 




pfkB 


COG0525 


Present (ribokinase) 


Embdpn-Mpv^rhof nathwav i^ nnt funrtinnal in H rw/lnri 


Pyruvate kinase 


pykA 


COG0470 


Missing 






pykF 






fructose bisphosphatase (HP1 385) and 










phosphoenolpyruvate synthase (HP0121), are present in 










Fl. pylori, allowing it to produce sugars required for 










peptidoglycan biosynthesis. 


\J (J) lUo|JI lUyiUOUMdlt; 


gna 


UUuUOoU 


Missing 


Pentose phosphate pathway is also not functional. Even 


vj^i ivui uuci laac 








though H. pylori has a ribose 5-phosphate isomerase 




fpiA 


f^(~\r*T\ 1 on 


Missing 


encoded by an ortholog of the E. coli rpiB, no gene coding 


l owl 1 l aotz 








for 6-phosphogluconate dehydrogenase could be identified. 










The only saccharolytic pathway in H. pylori appears to be 










the Entner-Doudoroff pathway. 


Lipoate synthase 


lipA 


V^WVJJWO I o 




Pyruvate dehydrogenase complex is absent in H. pyfori\ 


Lipoate~protein 


fpiA 


V-/ V-J \J *T 1 1 




acetate kinase and phosphotransacetylase are not 


ligase 


lipB 


COG031 9 


\A iQQinn 


lUMuiiuiicit. ryruvdm iciitjcjOAin oxiuoreuucidS*? is ine oniy 


r)i hurl ro! inon mi fit* 


aceF 




Missing 


acetyl-CoA-producing enzyme in H. pylori 


uvjlll CXI CI OC 










Acetate kinase 


3ckA 














iroiniesriiri 




Pho^nho- 


pa 




Disrupted by 




t ran sacety lass 






■fro mochiftc 
1 1 ell T ItJol in lc> 




Enzymes of purine 


purF 


COG0034 




use uuw purine uiuoyrunesis ib duseni in n. pyion, dnu u 


biosynthesis 


purD 


v*/ v»j kj i *j i 


indi>iivdic?u uy 


- has to obtain purines from the host. HP1 185 appears to be 








mutations 


the best candidate for the purine permease, as it is the oniy 




N 




Missing 


H. pylon protein, similar to £. coli Pur P. 




P purT 




Missing 




purL 1 


COG0046 








puri_2 


COG0047 


Missing 


On the other hand, H. pylori encodes the enzymes for AMP 




purM 


COG0150 


Missing 


and GMP synthesis from IMP and their interconversion. 




purK 


COG0026 


Missing 


Therefore, it can survive on any of these purines. 




purE 


COG0041 


Missing 






purC 


COG0152 


Missing 






purhf 


COG0138 


Missing 






purA 


COG0104 


Present 






purB 


COG0015 


Present 






guaB 


COG0516 


Present 






guaA_1 


COG0518 


Present 






guaA_2 


COG0519 


Present 





sive duplication with divergence in the eukaryotes. The 
expansion of this domain into a number of eukaryotic pro- 
teins involved in cell-cycle control [5()"\51] may have 
been critical for the very establishment of these systems. 

With the current acceleration in protein structure determi- 
nation [22,24], a superfamily identified by sequence com- 
parison more and more frequently extends to include 
proteins with known 3D structure and/or we II -character- 
ized catalytic mechanism (Table 3). Such findings are 
sometimes most illuminating as they immediately result in 
the prediction of the structural fold, the structure of the 
active center, and possibly also the catalytic mechanism for 
a wide variety of diverse proteins comprising the super- 
family. This is illustrated by the recent prediction of the 



structure and the catalytic amino acid residues For P- 
ATPases. which remained elusiv e in spite of a long history 
of studies, on the basis of the sequence motifs shared with 
haloacid dehalogenases [52*|. 

Assignment of the gene products to structural folds and fam- 
ilies with maximal attainable precision is arguably one of the 
foremost tasks of genome analysis after the sequencing 
phase. The number of structures that have been determined 
experimentally is negligible for almost all genomes, with the 
exception of E. coli (where it is still rather a small fraction) 
(Table 1 ). A database search with a deliberately conservative 
similarity cut-off already increases the fraction of proteins for 
which a confident structure prediction is possible to 10-25% 
[53*] (Table 1). Secondary structure-based threading allows 



. 360 Sequences and topology 



CO 

Si £ 

8 o 

CL c 

GJ C 

cr m 



o 



5 

CD* 



V o 

X) -1 



o 

j± E 



51 

CD «? 



CD 

E 



IT) 
CD 



il 

°f O 
XJ ^ 

CO ~f 

i — 

x E 

T* T 
cb 6 



cn *- 
Q. $ 



CM 



in 
sz 
o 



CN 

in 

CL 
- CD 

I* 

_c co 



o . 

CO 



-O «1 



to 
-c E 



o 
to 
0" 



a — 

O) cn 
co" 



CN 

E 



.c: ~ 

- cc 

x" O 

0 3f 



CD ^ 
LL C/D 



x S 

XI c 



E 



a 
c 

CO 

a 

© 



c 

(9 

u 

"E 
a> 



<D 
U 

£ 
o 



r 

o 

CL 
O 

ci 

XI 



CD 

E 



81 

Is 

CO CD 



E 

CO 



< 
Q 



>> CO 

» I, 

■° "5 



S >■ 

c -o 

§ * 

CD _ 

Q_ co 

Qa> 
CO © 

§ 5 2 

_Q CO "C 

co co 0L 

CD 4g 

CD >* < 

CO x» 

cB 



E § 



_CP Q. 



o n 

< *T c 

^ CD _ 

Q T3 = 



■v Ij .£ 
a> >, o 

CD = it 

CO CD CX 

C O _ 

O c C 

^ c 8. 

5 II 

Duo 



II! 

co ra >. 

CD O 1? 
X) C o 

® E o 



CD* O 



-D co 
? P 



CO ^ 
- .c 



iS 

J CD 

to r- 

o g 

.9- o 

CO >, 

2 g 

o 2 

CL 3 
O 



CL C 

CD JS 

O CD 

9 £ 



c © 

E § 

to S 

CP o 

Q. rt 

5 x> 

< -c 



c^ 



xj E 
E o 



~ 2 e 

CC cp X* 

CN '£ ® 

— O .£ 

Q) -C CO 

CO CL O 

CO co c 

CD O CD 

5 €i 



(0 



CD 



CD 
CL 
T 10 
CD ~0 
CO ' 



5* To 2 
3 E IT 



S2 to 



rr- u; 

S O 



CD 

CD CO 



rt to 

cl -J5 

© CL 

-Cj CO 

Q- O 

8* 

« Q) 

CO CO 

CL CL 

CO CO 

o o 

CL CL 

-r. P 



O O 

X» CD 

1 ^ 

CO >» 

CL X> 



o 
c 

CO > 

: i ® 

CO CO 



I § 

.3 a 

X! O 

CO £ 



CD 

E 

CO 

c 
"o 



^ o E 
"co if 

CO O CD 



■c o CO 
«i F 

o -= E 
i|2 



CO >^ 



CD .-5 ' 
^ CL 



CO CO 



CO CD 
ci) » 
CO 

c w 
ra 

CL '9 

D- 
^ 0 



1-2 t 



O CD 



X h— CO 

C fc- 



CO ^ 



3 £ 



XJ CD 
^ to 

CO CO 



O 0- 



"cb to 5 

a« o 

co iE — 

O CO to 

CL CL qj 



5 Q-q. 
£ CO - 



a E 

cd cn o 

i CD £ 

CD S fS 

co i3 E 

O CO o 

.C JZ JZ 

CL CL CL 

CO to CO 
OOO 

JZ SI JZ 

CL a a 



o 

CL 

o 



< 
o 
O 



CD _£ 
iS CN 



co ^ 

0 to 

1 9 

J2 co 

O Q 



ft 

|| 

o e 

.9 io 

X) o 



to 3 
o 

XI CO 
X^ ^ 

C- CD 
to 



CD 
CO 

_o 

CO 
X 

CD 
Xt 
XJ 

O 

CO 



O 
2 



x> cn 

CL CD O 

CO CL _ 

ro CD ^ — . 

^■o £ w 

CO ■ CO CD 

n * Q_ to 

Q- j_ 2 to 

5 < 6 



CL X 
I CO 



X 
I 

a 



XI co 

C JZ 

A, CL , 

£ to . 

CD CL 



O o 

It 

XJ CO 
CO q 
CN CO 



0 ™ 
J> to 

CO ~ 

° 3 

CL XI 

8 9 

CO Q) 

cd to 

1 S S 

— to JO 
8 « 3 

X> JC CO 
CL 



-C CO 



CD 



E « JS 
2 S § 0 



^ x 

CL 
O 



CL 

o P 
•§. o 

CO 0? 

o o 



X3 



CD 

Q. CO 

1 5 

CL CL 

CO CO 

o o 



CL CL o_ c to 



^ b 

*— * CD fp 

».C fl) 

• fll ^ 

CO co - 

Q. 5 

to O 

O CO 

CD >> »r to 



— CO 
CD 1- 

co 'rr 



, .v CO 
< ± ^ 



3 w 



-£ CD c 

a E S 

Q> XJ 



o 



i= CO 



ro 



® CD 



CO CL 

^ fi -s 



O o 



CO 5 

CO - 

to 

I s 

CD ? 

9- M 



CO £ 



o 
£ 

X) 



c 

o o 
o .. - 

il 



E * 

o — 

CD CD 
CO .ES 

£ § 

a S 
E cu 
o c 



CP - 

x ^ 

O re 



CO 5 
XJ 

1 1- 



> 



Beyond complete genomes Koonin, Tatusov and Galperin 361 



another relatively small but notable increase in the predictive 
power [54*| (Table 1). It appears, however chut ar this time, 
rhe most realistic way to further structure prediction at 
genome scale is to perform a complete analysis of protein 
supcrfainilies as exemplified in Table 1 

Perspective 

As far as pro kary otic genomes are concerned, we have 
already entered the post-genomic era. While surprises 
certainly wait ahead, there is little doubt that the major 
protein families arc already known or can be deciphered 
from the available sequences. We have recently seen 
major progress in methods and procedures for advanced 
sequence analysis, and a lot of valuable information has 
been extracted from the genomes. We believe, however, 
that a major focused effort in genome comparison is still 
required in order to construct a proper classification of 
protein families and supcrfamilies and systematically 
apply it to the goals of structural and functional predic- 
tion. Such an effort will have the potential of creating a 
basis for a rationally designed, decisive onslaught on 
structure determination and experimental identification 
of gene functions using computer predictions as a guide. 
Hopefully, this research program turns out to be both 
realistic and efficient. 

Ref rences and recommended reading 

Papers of particular interest, published within the annual period of review, 
have been highlighted as: 

• of special interest 
of outstanding interest 

1 . Fleischmann RD, Adams MD. White O, Clayton RA, Kirkness ER 
Kerlavage AR, Bult CJ. Tomb JF. Dougherty BA. Merrick JM et at.: 
Whole-genome random sequencing and assembly of 
Haemophilus influenzae Rd. Science 1995. 269:496-512. 

2. Fraser CM, Gocayne JD, White O. Adams MD. Clayton RA. 
Fieischmann RD. Bull CJ. Kerlavage AR. Sutton G ; Kelley JM et a/.: 
The minimal gene complement of Mycoplasma genita/ium. 
Science 1 995. 270:397-403. 

3. Himmelreich R, Hilbert H. Plagens H, Pirkl E. Li BC. Herrmann R: 
Complete sequence analysis of the genome of the bacterium 
Mycoplasma pneumoniae. Nucleic Acids Res 1996, 24:4420-4449. 

4. Bult CJ, White O. Olsen GJ. Zhou L Fleischmann RD, Sutton GG, 
Blake J A, FitzGerald LM. Clayton RA. Gocayne JD et a/.: Complete 
genome sequence of the methanogenic archaeon, 
Methanococcus jannaschii. Science 1996. 273:1058-1073. 

5. Kaneko T. Sato S, Kotani H, Tanaka A. Asamizu E. Nakamura Y. 
Miyajima N. Hirosawa M. Sugiura M, Sasamoto S et a/.: Sequence 
analysis of the genome of the unicellular Cyanobacterium 
synechocystis sp. strain PCC6803. II. Sequence determination of 
the entire genome and assignment of potential protein-coding 
regions. DNA Res 1996. 3:109-136. 

6. Goffeau A. Barrell BG, Bussey H. Davis RW. Dujon B. Feldmann H, 
Galibert F. Hoheisel JD, Jacq C. Johnston M er a/.: Life with 6000 
genes. Science 1996, 274:546. 563-567. 

7. Blattner FR. Plunkett G : 3rd, Bloch CA. Perna NT, Burland V, Riley M. 
*• Collado-Vides J T Glasner JD. Rode CK. Mayhew GF ef a/.: The 

complete genome sequence of Escherichia coti K-1 2. Science 

1997. 277:1453-1474. 
The completion of the genome sequence of E. coti, one of the classic 
objects of molecular biology and genetics, certainly has a symbolic 
significance. More importantly, the enormous amount of information 
available regarding £ coti gene functions can now be used to full potential 
for inferring functions of homologs in other species. However, the functions 
of about one half of the £ coti genes have not been determined 
experimentally, and so there is still a lot to learn about £ coti itself. 



8. Tomb J, White O, Kerlavage A, Clayton R. Sutton G. Fleishmann R, 
Ketchum K, Klenk HP. Gill S. Dougherty BA et a/.: The complete 
genome sequence of the gastric pathogen Helicobacter pylori. 
Nature 1 997 388:539-547. 

The genome sequence of this bacterium is of special interest from several 
points of view. The genome analysis will have important practical 
implications as H. pylori is the causative agent of peptic ulcers and is 
believed to infect up to half of the human population. H. pylori thrives in a 
highly acidic environment (pH 2-3); deciphering the mechanisms of acid 
tolerance from the genome sequence is a most interesting task. 
Furthermore, H. pylori represents an early branching of the proteobacterial 
lineage, and the comparison of its genome with those of other 
Proteobacteria such as £ coti and Haemophilus influenzae will shed light 
on the evolution of cellular functions in bacteria and mitochondria. 

9. Smith DR, Doucette-Stamm LA, Deioughery C, Lee H. Dubois J. 
•• Aldredge T, Bashirzadeh R, Blakely D. Cook R, Gilbert K er at.: 

Complete genome sequence of Methanobacterium 
thermoautotrophicum delta H: functional analysis and comparative 
genomics. J Bacterioi 1 997, 1 79:7 1 35-7 1 55. 
This second genome of a methanogenic archeon to be sequenced, after 
Methanococcus jannaschii, rs of major importance in corroborating 
trends revealed by the M. jannaschii genome analysis [4,34}. Like M. jan- 
naschii, there is a sharp divide between the majority of the genes, which 
appear to have, bacterial origin, and a minority (primarily encoding pro- 
teins involved in genome replication and expression) of 'eukaryotic* 
genes. Some other unusual aspects of the M. jannaschii genome, how- 
ever, did not recur in M. thermoautrophicum. For example, unlike M. jan- 
naschii, M. thermoautrophicum encodes a typical set of molecular 
chaperones such as DnaK and DnaJ and does not encode a unique 
ATPase family found in M. jannaschii. 

1 0. Klenk HP, Clayton RA, Tomb JF, White O, Nelson KE, Ketchum KA, 
Dodson RJ, Gwinn M, Hickey EK, Peterson JD ef at.: The complete 
genome sequence of the hyperthermophilic, sulphate -reducing 
archaeon Archaeoglobus fulgidus. Nature 1 997, 390:364-370. 

The first sequence of a non-methanogenic archeon, and the third complete 
archeal genome altogether. With 2436 genes, the A. fulgidus genome is 
considerably larger than those of M. jannaschii and M. thermoautrophicum, 
in part due to more extensive duplication in some of the gene families. 
Unlike M. jannaschii and M. thermoautrophicum. A. fulgidus does not seem 
to encode any inteins. With three genome sequences available, there is for 
the first time an opportunity for an informative comparative analysis of 
archeal genomes. Definitive work in this area remains to be done, but it is 
already clear that the three genomes generally are highly coherent, and aJso 
that there are many mysterious conserved families, creating a challenge for 
further research, both theoretical and experimental. 

1 1 . Kunst F. Ogasawara N, Moszer I, Albertini AM. Alloni G, Azevedo V, 
Bertero MG, Bessieres P, Bolotin A, Borchert S et at.: The complete 
genome sequence of the gram-positive bacterium Bacillus 
subtilis. Nature 1 997. 390:249-256. 

The second classic bacterial model, after £ coti. and also the second 
largest bacterial genome sequenced so far (4100 genes compared with 
4288 genes in £ coti). Wtth B. subtilis adequately representing the Gram- 
positive lineage (only the minimal genomes of Mycoplasma had been 
available before), we may now have a sampling of the great majority of 
bacterial gene families. In addition to its value for comparative analysis, B. 
subtilis is most interesting and important in its own right, given, for example 
the large number of genes in its genome that encode enzymes of secondary 
metabolite synthesis. 

1 2. Fraser CM. Casjens S. Huang WM, Sutton GG. Clayton R, Laihigra 
R. White O, Ketchum KA. Dodson R. Hickey EK et at.: Genomic 
sequence of a Lyme disease spirochaete, Borre/ia burgdorferi. 
Nature 1997. 390:580-586. 

The first genome representing yet another major division of bacteria, the 
spirochetes. The genome has a number of unique features, above all a linear 
chromosome unusual in the bacterial world, and at 'east 1 7 linear and circular 
plasmids that contain about 30% of the genes. Most of the plasmid -borne 
genes remain quite mysterious, at least after the initial genome analysis. 

1 3. Mushegian AR, Koonin EV: A minimal gene set for cellular life 
derived by comparison of complete bacterial genomes. Proc Natl 
Acad Set USA 1 996, 93: 1 0268- 1 0273. 

1 4. Koonin EV. Galperin MY: Prokaryotic genomes: the emerging 
paradigm of genome- based microbiology. Curr Opin Genet Dev 
1997. 7:757-763. 

1 5. Benson DA, Boguski MS, Lipman DJ, Ostefl J. Ouellette BFF: 
GenBank. Nucleic Acids Res 1998, 26:1-7. 

1 6. Dayhoff MO, Barker WC, Hunt LT: Establishing homologies in 
protein sequences. Methods Enzymol 1 983, 91 :524-545. 

1 7 Barker WC, Garavelli JS, Haft DH, Hunt LT. Marzec CR, Orcutt BC, 
Srinrvasarao GY, Yeh LSL f Ledley RS, Mewes HW et at.: The PIR- 



362* Sequences and topology 



International Protein Sequence Database. Nucleic Adds Res 
1998, 26:27-32. 

1 8. Bairoch A, Bucher P, Hofmann K: The PROSITE database, its status 
in 1997. Nucleic Acids Res 1997, 25:217-221, 

1 9. Attwood TK t Beck ME, Rower DR, Scordis P, Seltey JN: The PRINTS 
protein fingerprint database in its fifth year. Nucleic Acids Res 
1998 : 26:306-311. 

20. Sonnhammer ELL, Eddy SR, Birney E, Baleman A, Durbin R: Pfam: 
multiple sequence alignments and HMM-profiles of protein 
domains. Nucleic Acids Res 1998, 26:322-325. 

2 1 . Corpet F. Gouzy J. Kahn D: The ProDom database of protein 
domain families. Nucleic Acids Res 1 998, 26:325-328. 

22. Holm L. Sander C: Touring protein fold space with Dali/FSSP. 
Nucleic Adds Res 1 998. 26:31 8-321 . 

23. Orengo CA. Michie AD, Jones S. Jones DT. Swindells MB. Thornton JM: 
CATH — a hierarchic classification of protein domain structures. 
Structure 1997, 5:1093-1108. 

24. Hubbard TIP, Murzin AG, Brenner SE, Chothia C: SCOP: a structural 
classification of proteins database. Nucleic Acids Res 1 997. 
25:236-239. 

25. Murzin AG. Brenner SE, Hubbard T, Chothia C: SCOP: a structural 
classification of proteins database for the investigation of 
sequences and structures. J Mol Biol 1995, 247:536-540. 

26. Fitch WM: Distinguishing homologous from analogous proteins. 
SystZoo! 1970, 19:99-113. 

27. Frtch WM: Uses for evolutionary trees, Phil Trans R Soc Lond B 
Biol Sd 1995. 349:93-102. 

28. Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of 
Escherichia colt proteins: functional and evolutionary implications. 
Proc Natl Acad Sd USA 1995,92:11921-11925. 

29. Labedan B, Riley M: Widespread protein sequence similarities: 
origins of Escherichia coli genes. J Bacteriol 1995, 177:1 585-1588. 

30. Labedan B, Riley M: Gene products of Escherichia coli: sequence 
comparisons and common ancestries. Mol Biol Evot 1 995. 
12:980-987 

31 . Riley M t Labedan B: Protein evolution viewed through 
Escherichia coli protein sequences: introducing the notion of a 
structural segment of homology, the module. J Mot Biol 1997, 
268:857-868. 

32. Brenner SE, Hubbard T, Murzin A, Chothia C: Gene duplications in 
H. influenzae. Nature 1995, 378:140. 

33. Tatusov RL, Mushegian AR, Boric P, Brown NP, Hayes WS, 
Borodovsky M. Rudd KE, Koonin EV: Metabolism and evolution of 
Haemophilus influenzae deduced from a whole-genome 
comparison with Escherichia coli. Curr Bio! 1 996, 6:279-291 

34. Koonin EV, Mushegian AR Galperin MY, Walker DR: Comparison of 
• archaeal and bacterial genomes: computer analysis of protein 

sequences predicts novel functions and suggests a chimeric 
origin for the archaea. Mol Microbiol 1 997, 25:6 1 9-637. 
A detailed comparison of the first available archeal genome (M. jannaschh) 
with bacterial genomes produced a number of novel functional predictions 
and led to the conclusion that the majority of archeal genes most probably 
have a bacterial origin. Furthermore, generalizations started to emerge, 
including the nearly constant fraction of genes containing ancient 
conserved regions - about 709b in all genomes - and the same major 
superfamilies of paralogs. 

35. Clayton RA, White O, Ketchum KA, Venter JC: The first genome 
from the third domain of life. Nature 1997, 387:459-462. 

36. Overbeek R. Larsen N, Smith W, Maltsev N, Selkov E: Representation 
of function: the next step. Gene 1 997. 191 :GC1 -GC9. 

37. Tatusov RL Koonin EV, Lipman DJ: A genomic perspective on 
♦• protein families. Science 1 997, 278:631 -637. 

Comparative analysis of the proteins encoded in seven complete genomes 
from five major phylogenetic lineages and elucidation of consistent patterns 
of sequence similarities resulted in the delineation of 720 clusters of 
orthologous groups (COGs). Each COG consists of individual orthologous 
proteins or orthologous sets of paralogs from at least three lineages. 
Orthologs typically have the same function, allowing transfer of functional 
information from one member to an entire COG. This automatically makes 
possible a number of functional predictions, especially for poorly 
characterized genomes. The evolving system of COGs comprises a 



framework for functional and evolutionary genome analysis; it is accessible 
through the World Wide Web {http://ncbi.nlm.nih.gov/COG). 

38. Himmelreich R, Ragens H Hilbert H, Reiner B, Herrmann R: 
Comparative analysis of the genomes of the bacteria 
Mycoplasma pneumoniae and Mycoplasma genitafium. Nucleic 
Acids Res 1 997 25:701 -712. 

39. Koonin EV. Mushegian AR. Bork P: Non-orthologous gene 
displacemenL Trends Genet 1996, 12:334-336. 

40. Galperin MY, Bairoch A, Koonin EV: A superfamily of 
metalloenzymes unifies phosphopentomutase and cofactor- 
independent phosphoglycerate mutase with alkaline 
phosphatases and sulfatases. Protein Sci 1 998. 7:in press. 

41 . Danson MJ: Central metabolism of the archaea. In The 

Biochemistry of Archaea (Archaebacteria). Edited by Kates M : 
Kushner DJ, Matheson AT. Amsterdam: Elsevier; 1993:1 -24. 

42. Romano AH. Conway T: Evolution of carbohydrate metabolic 
pathways. Res MicrobioH 996. 147:448-455. 

43. Bork P, Koonin EV: Protein sequence motifs. Curr Opin Struct Biol 

1996, 6:366-376. 

44. Bork P, Gibson TJ: Applying motif and profile searches. Methods 
Enzymol 1996, 266:162-184. 

45. Henikoff S. Henikoff JG: Embedding strategies for effective use of 
information from multiple sequence alignments. Protein Sci 1 997. 
6:698-705. 

46. Neuwald AF, Liu JS. Lipman DJ, Lawrence CE: Extracting protein 
alignment models from the sequence database. Nucleic Acids Res 

1997, 25:1665-1677. 

47. Altschul SR Madden TL, Schaffer AA. Zhang J : Zheng Z, Miller W. 
Lipman DJ: Gapped BLAST and PSI-BLAST - A new generation of 
protein database search programs. Nucleic Acids Res 1 997 
25:3389-3402. 

A major revamp of BLAST, which is definitely the most popular current 
method for database search. The key innovations are: first, the program now 
makes gapped alignments, with appropriately modified statistics, which 
results in significant increase of sensitivity; and second, the associated 
program PSl (Position-Specific Iterating) -BLAST makes a position-specific 
weight matrix (profile) out of the first pass results and iterates searches with 
this profile until no new sequences with similarity scores above a defined 
cut-off are detected. This appears to be the most powerful existing method 
tor detection of subtle similarities between protein sequences and 
delineation of protein superfamilies. 

48. Mushegian AR, Bassett DE Jr, Boguski MS. Bork P, Koonin EV: 
• Positionally cloned human disease genes: patterns of 

evolutionary conservation and functional motifs. Proc Natl Acad 

Sci USA 1997. 94:5831-5836. 
Sequence analysis of the proteins encoded by 70 positionally cloned 
human disease genes showed that most of them have orthologs with the 
same domain architecture in the nematode, but domain rearrangements are 
prevalent in yeast and bacterial homologs. This is one of the first 
demonstrations of the utility of PSI-BLAST for the delineation of large 
protein superfamilies. In particular, this method was used for the 
identification of a conserved ATPase domain present in the repair protein 
MutL (one of the colon cancer gene products in humans), histidine kinases, 
molecular chaperones of the HSP90 family and type II DNA 
topoisomerases; the 3D structure for the latter was already available, 
defining the fold for the whole superfamily. 

49. Bork R Koonin EV: Predicting functions from protein sequences: 
where are the bottlenecks? Nature Genet 1 998, 1 8:31 3-3 1 8. 

An attempt to analyze the reasons why it is so common that functionally 
and phylogenetically important relationships between sequences are not 
detected in original analysis (particularly in the framework of genome 
projects) but are readily identified in subsequent, more detailed studies. It 
appears that the major bottlenecks include inadequate filtering for noise »r- 
sequence data (for example low-complexity sequences and very common 
domains) and insufficient cross-talk between different types of information. 

50. Bork P, Hofmann K, Bucher P. Neuwald AR Altschul SF, Koonin EV: A 
superfamily of conserved domains in DNA damage- responsive 
cell cycle checkpoint proteins. FASEB J 1 997. 1 1 :68-76. 

A complete description of the BRCT domain that had been originally fGunc 
in BRCA1 protein and several other proteins implicated in cell c:yc' ft 
checkpoint. In this work, the superfamily has been extended to include a 
distinct version ol the BRCT domain detected in bacterial DNA tigases. the 
large subunits of eukaryotic replication factor C, and poly(ADP-ribose- 
polymerases. The expansion of the BRCT domain in eukaryotes may be o ne 
of the key events in the evolution of cell-cycle control. 



Beyond complete genomes Koonin, Tatusov and Galperin 363 



51 . Callebaut I, Mornon JP: From BRCA1 to RAP1: a widespread 8RCT 
module closely associated with DMA repair. FEBS Lett 1 997, 
400:25-30. 

52. Aravind L, Galperin MY. Koonin EV: The catalytic domain of the P- 

* type ATPase has the halo acid dehalogenase fold. Trends Biochem 
Sci 1998, 23:127-129. 

This paper is an example of the application of sequence profile analysis to 
the prediction of the 3D fold and the catalytic residues in a critically 
important enzyme, P-ATPase, which has defied crystallization attempts and 
remained poorly characterized in spite of intense effort. 

53. Frishman D. Mewes HW: PEDANTic genome analysis. Trends Genet 

* 1997, 13:415-416. 

This paper describes a very convenient Worldwide Web site compiling 
results of automatic analysis of all available complete genomes. The Pedant 
WWW site {http://p€o^tmips.bio<^em.mpg.deyfrishman/pedant.html) is 
arguably one of the best entry points to comparative genomics but it has 
to be kept in mind that it is only the first level, crude analysis that is 
presented here. 

54. Fischer 0, Eisenberg D: Assigning folds to the proteins encoded by 

* the genome of Mycoplasma genitatium. Proc Natl Acad Sci USA 
1997,94:11929-11934. 

One of the first systematic attempts to predict the 3D structures of proteins 
starting from a complete genome. The utility of sequence-structure 
threading is demonstrated but it also becomes clear that such methods at 
best result in a rather small, incremental improvement over state-of-the-art 
sequence comparisons. Although the fraction of the proteins with a 



predictable fold is only 22% of the gene products, the authors predict by 
extrapolation that it should be possible to assign folds to most soluble 
proteins within a decade. 

55. Holm L, Sander C. An evolutionary treasure: unification of a broad 
• set of amidohydrolases related to urease. Proteins 1 997 28:72-82. 
A valuable example of a combination of detailed sequence analysis with 
structure-structure comparisons resulting in the characterization of a vast 
protein superfamily. 

56. Stukey J, Carman GM: Identification of a novel phosphatase 
sequence motif. Protein Sci 1 997, 6:469-472. 

57. Neuwald AF: An unexpected structural relationship between 
integral membrane phosphatases and soluble haloperoxidases. 

Protein Sci 1 997, 6:1 764-1 767. 

58. Galperin MY, Koonin EV: A diverse superfamily of enzymes with 
ATP-dependent carboxylate-amine/thiol Hgase activity. Protein Sci 
1997,6:2639-2643. 

59. Aravind L, Koonin EV: A novel family of predicted 
phosphoesterases includes Drosophila prune protein and 
bacterial RecJ exonuclease. Trends Biochem Sci 1998, 23:17-19. 

60. Bond CS, Clements PR, Ashby SJ. Collyer CA, Harrop SJ, Hopwood 
JJ. Guss JM: Structure of a human lysosomal sulfatase. Structure 
1997,5:277-289, 



